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Abstract 

The graduated optimization approach, also known as the continuation method, is 
a popular heuristic to solving non-convex problems that has received renewed interest 
over the last decade. Despite its popularity, very little is known in terms of theoretical 
convergence analysis. 

In this paper we describe a new first-order algorithm based on graduated optimiza¬ 
tion and analyze its performance. We characterize a parameterized family of non- 
convex functions for which this algorithm provably converges to a global optimum. In 
particular, we prove that the algorithm converges to an e -approximate solution within 
0(l/e 2 ) gradient-based steps. We extend our algorithm and analysis to the setting of 
stochastic non-convex optimization with noisy gradient feedback, attaining the same 
convergence rate. Additionally, we discuss the setting of “zero-order optimization”, 
and devise a a variant of our algorithm which converges at rate of 0(d 2 /e 4 ). 


1 Introduction 

Non-convex optimization programs are ubiquitous in machine learning and computer vision. 
Of particular interest are non-convex optimization problem that arise in the training of deep 
neural networks Bengio (2009). Often, such problems admit a multimodal structure, and 
therefore, the use of convex optimization machinery may lead to a local optima. 

Graduated optimization (a.k.a. continuation), Blake and Zisserman (1987), is a method¬ 
ology that attempts to overcome such numerous local optima. At first, a coarse-grained 
version of the problem is generated by a local smoothing operation. This coarse-grained ver¬ 
sion is easier to solve. Then, the method advances in stages by gradually refining the problem 
versions, using the solution of the previous stage as an initial point for the optimization in 
the next stage. 
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Despite its popularity, there are still many gaps concerning both theoretical and practical 
aspects of graduated optimization, and in particular we are not aware of a rigorous running 
time analysis to find a global optimum, or even conditions in which a global optimum is 
reached. Nor are we familiar with graudated optimization in the stochastic setting, in which 
only a noisy gradient or value oracle to the objective is given. Moreover, any practical 
application of graduated optimization requires an efficient construction of coarse-grained 
versions of the original function. For some special cases this construction can be made 
analytically Chapelle et al. (2006); Chaudhuri and Solar-Lezama (2011) . However, in the 
general case, it is commonly suggested in the literature to convolve the original function 
with a gaussian kernel Wu (1996). Yet, this operation is prohibitively inefficient in high 
dimensions. 

In this paper we take an algorithmic / analytic approach to graduated optimization and 
show the following. 

• We characterise a family of non-convex multimodal functions that allows convergence 
to a global optimum. This parametrized family we call cr-nice (see Definition 4.2 ). 

• We provide a stochastic algorithm inspired by graduated optimization, that performs 
only gradient updates and is ensured to find an ^-optimal solution of cr-nice functions 
within 0(1/a 2 e 2 ) iterations. The algorithm doesn’t require expensive convolutions and 
access the smoothed version of any function using random sampling. The algorithm 
only requires access to the objective function through a noisy gradient oracle. 

• We extend our method to the “zero-order optimization” model (a.k.a. “bandit feed¬ 
back” model), in which the objective is only accessible through a noisy value oracle. 
We devise a variant of our algorithm that is guaranteed to find an e-optimal solution 
within 0(d 2 /a 2 e 4 ) iterations. 

Interestingly, the next question is raised in Bengio (2009) which reviews recent develop¬ 
ments in the held of deep learning: “Can optimization strategies based on continua¬ 
tion methods deliver significantly improved training of deep architectures?” 

As an initial empirical study, we examine the task of training a NN (Neural Network) over 
the MNIST data set. Our experiments support the theoretical guarantees, demonstrating 
that graduated optimization according to the methodology proposed accelerates convergence 
in training the NN. Moreover, we show examples in which cr-nice functions capture non- 
convex structure/phenomena that exists in natural data. 

1.1 Related Work 

Among the machine vision community, the idea of graduated optimization was known since 
the 80’s. The term “Graduated Non-Convexity” (GNC) was coined by Blake and Zisserman 
(1987), who were the first to establish this idea explicitly. Similar attitudes in the machine 
vision literature appeared later in Yuille (1989); Yuillc et al. (1990), and Terzopoulos (1988). 
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Concepts of the same nature appeared in the optimization literature Wu (1996), and in 
the held of numerical analysis Allgower and Georg (1990). 

Over the last two decades, this concept was successfully applied to numerous problems 
in computer vision; among are: image deblurring Boccuto et al. (2002) , image restoration 
Nikolova et al. (2010), and optical how Brox and Malik (2011). The method was also adopted 
by the machine learning community, demonstrating effective performance in tasks such as 
semi-supervised learning Chapelle et al. (2006), graph matching Zaslavskiy et al. (2009), 
and ranking Chapelle and Wu (2010). In Bengio (2009), it is suggested to consider some 
developments in deep belief architectures Hinton et al. (2006); Erhan et al. (2009) as a kind of 
continuation. These approaches, in the spirit of the continuation method, offer no guarantees 
on the quality of the obtained solution, and are tailored to specihc applications. 

A comprehensive survey of the graduated optimization literature can be found in Mobahi 
and Fisher III (2015a). 

A recent work Mobahi and Fisher III (2015b) advances our theoretical understanding, 
by analyzing a continuation algorithm in the general setting. Yet, they offer no way to 
perform the smoothing efficiently, nor a way to optimize the smoothed versions; but rather 
assume that these are possible. Moreover, their guarantee is limited to a fixed precision that 
depends on the objective function and does not approach zero. In contrast, our approach 
can generate arbitrarily precise solutions. 


2 Setting 


We discuss an optimization of a non-convex loss function / A 4 8, where K C | d is a 
convex set. We assume that optimization lasts for T rounds; On each round t = 1,..., T, we 
may query a point G /C, and receive a feedback. After the last round, we choose x T G /C, 
and our performance measure is the excess loss, defined as: 


/( x t) 


— min 

xev 


/(x) 


In Section 4.2 we characterize a family of non-convex multimodal functions we call er-nice. 
Given such a cr-nice loss /, we are interested in algorithms that with a high probability 
ensure a e-excess loss within poly(l/e) rounds. 

We consider two kinds of feedback: 


1. Noisy Gradient feedback: Upon querying x t we receive V/(x^ +£ t , where {£ r }> =1 
are independent zero mean and bounded r.v.’s. 


2. Noisy Value feedback (Bandit feedback): Upon querying x* we receive /(x t ) + £t, 
where {£ T }^ =1 are independent zero mean and bounded r.v.’s. 


3 Preliminaries and Notation 

Notation: During this paper we use B,§ to denote the unit Euclidean ball/sphere in M , 
and also B r (x), § r (x) as the Euclidean r-ball/sphere in centered at x. For a set A C 
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, u ~ A denotes a random variable distributed uniformly over A. 


3.1 Strong-Convexity 

Recall the definition of strongly-convex functions: 

Definition 3.1. (Strong Convexity) We say that a function F : R n —)■ M. is o-strongly 
convex over the set K, if for all x, y e /C it holds that, 

F( y) > F(x) + VF(x) T (y - x) + |||x - y|| 2 

Let F be a u-strongly convex over convex set /C, and let x* be a point in /C where F is 
minimized, then the following inequality is satisfied: 

|||x-x*|| 2 < F{x)-F(x*) (1) 

This is immediate by the definition of strong convexity combined with VF(x*) T (x — x*) > 
0, Vx e /C. 

4 Smoothing and cr-Nice functions 

Constructing finer and finer approximations to the original objective function is at the heart 
of the continuation approach. In Section 4.1 we define the smoothed versions that we will 
employ. Next, in Section 4.1.1 we describe an efficient way to implicitly access the smoothed 
versions, which will enable us to perform optimization. Finally, in Section 4.2 we define a 
class of non-convex multimodal functions we denote as cr-nice. As we will see in Section 7, 
these functions are rich enough to capture non-convex structure that exists in natural data. 
Additionally, these functions lend themselves to an efficient optimization, and we can ensure 
a convergence to e-solution within poly(l/e) iterations, as described in Sections 5,6. 

4.1 Smoothing 

Smoothing by local averaging is formally defined next. 

Definition 4.1. Given an L-Lipschitz function f : M. d t—>■ R define it’s 5-smooth version to 
be 

fs(x) = E^ b [/(x + 5u)}. 

The next lemma bounds the bias between f$ and /. 

Lemma 4.1. Let f$ be the 5-smoothed version of f, then, 

Vx6K d : \f s (x)-f(x)\ <5L 
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Oracle 1: SGO g 

Input: x G M d , smoothing parameter 6 
Return: V/(x + Ju), where u ~ B 


Figure 1: Smoothed gradient oracle given gradient feedback. 


Oracle 2: SGOy 

Input: x G M. d , smoothing parameter 6 
Return: |/(x + 6v)v, where v ~ § 


Figure 2: Smoothed gradient oracle given value feedback. 

Proof of Lemma 4.1. 

I/i(x) - /(x)| = |E^ b [/(x + 5u)] - /(x) | 

< E u ^ b [|/(x + hu) - /(x)|] 

<E u ^ b [L||5u||] 

< L5 

in the first inequality we used Jensen’s inequality, and in the last inequality we used ||u|| < 1, 
since uGB. □ 

4.1.1 Implicit Smoothing using Sampling 

A direct way to optimize a smoothed version is by direct calculation of its gradients, neverthe¬ 
less this calculation might be very costly in high dimensions. A much more efficient approach 
is to produce an unbiased estimate for the gradients of the smoothed version by sampling the 
function gradients/values. These estimates could then be used by a stochastic optimization 
algorithms such as SGD (Stochastic Gradient Descent). This sampling approach is outlined 
in Figures 1,2. 

The following two Lemmas state that the resulting estimates are unbiased and bounded: 

Lemma 4.2. Let x G 5 > 0, and suppose that f is L-Lipschitz, then the output of SGOg 
(F igure 1) is bounded by L and is an unbiased estimate for V/^x). 

Proof. SGO g outputs V/(x + Ju) for some u G 1, so the first part is immediate by the 
Lipschitzness of /. Now, by definition, /^(x) = E u ^ B [/(x + Ju)], deriving both sides we get 
the second part of the Lemma. □ 

Lemma 4.3. Let x G /C C 5 > 0, and suppose that max x |/(x)| < C , then the output of 
SGOy (Figure 2) is bounded by and is an unbiased estimate for V/^x). 
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Proof. SGOy outputs |/(x+hv)v for some v£§, since / is C-Bounded over /C the first part 
of the lemma is immediate. In order to prove the second part, we can use Stokes theorem 
to show that if v ~ then: 

Vx G R d . E_s[/(x + 5v)v] = d - V/«(x) (2) 

A proof of Equation (2) is found in Flaxman et al. (2005). 

B 

Note that the oracles depicted in Figures 1, 2 may require sampling function values 
outside /C, (specifically in /C + hB). We assume that this is possible, and that the bounds 
over the function gradients/values inside /C, also apply in /C + 6M. 

Extensions to the noisy feedback settings: Note that for ease of notation, the oracles 
that appear in Figures 1, 2, assume we can access exact gradients/values of /. Given that 
we may only access noisy and bounded gradient/value estimates of / (Sec. 2), we could use 
these instead of the exact ones that appear in Figures 1,2, and still produce unbiased and 
bounded gradient estimates for the smoothed versions of / as shown in Lemmas 4.2,4.3. 

Particularly, in the case we may only access noisy gradients of /, then SGOg (Figure 1) 
will return V/(x + hu) + £ instead of V/(x + hu), where £ is a noise term. Since we assume 
zero bias and bounded noise this implies that V/(x + hu) + £ is an unbiased estimate of 
V/^x), bounded by L + K where K is the bound on the noise and L is the Lipschitz constant 
of /. We can show the same for SGO\/ (Figure 2), given a noisy value feedback. 


4.2 d-Nice Functions 


Following is our main definition 


Definition 4.2. A function f : /C H > M. is said to be cr-nice if the following two conditions 
hold: 


1 . 


Centering property: For every 5 > 0, and every xj G argmin xg ^ fs(x), there exists 
x* 5/2 G argmin xeK / 5 / 2 (x), such that: 




2. Local strong convexity of the smoothed function: For every 6 > 0 let r$ = 36, 

and denote x* 5 = argmin xg/c /^(x), then over B rs (x* s ), the function fs^x.) is o-strongly- 
convex. 


Hence, u-nice is a combination of two properties. Both together imply that optimizing 
the smoothed version on a scale 5 is a good start for optimizing a finer version on a scale of 
6/2, which is sufficient for a scheme based on graduated optimization to work as we show 
next. In Section 7 we show that cr-nice functions arise naturally in data. An illustration of 
cr-nice function in 1-dimension appears in Figure 3. 
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5 = 1 



5 = 0.5 

5 = 0.25 

<5 = 0 


Figure 3: A 1-dim cr-nice function (5 = 0), and its smoothed versions. 

5 Graduated Optimization with a Gradient Oracle 

In this section we assume that we can access a noisy gradient oracle for /. 

Thus, given x G M d , 5 > 0 we can use SGOg (Figure 1) to obtain an unbiased and 
bounded estimate for V/^x), as ensured by Lemma 4.2. Note that for ease of notation 
SGOg (Figure 1) is listed using an exact gradient oracle for /. As described at the end of 
Section 4.1.1, this could be replaced with a noisy gradient oracle for /, and Lemma 4.2, will 
still hold. 

Following is our main Theorem: 

Theorem 5.1. Let e G (0,1) and p G (0, 1/e), also let 1C be a convex set, and f be an L- 
Lipschitz a-nice function. Suppose that we apply Algorithm 1, then after 0(1/a 2 e 2 ) rounds 
Algorithm 1 outputs a point x M+1 which is e optimal with a probability greater than 1 — p. 

Algorithm 1 is divided into epochs, at epoch m it uses SGO g to obtain unbiased estimates 
for the gradients of f§ m which are then employed by Suffix-SGD (Algorithm 2), to optimize 
this smoothed version. This optimization over f§ m is performed until we are ensured to reach 
a point close enough to x^ +1 := arg min xe;c fs m+1 (x), i.e., the minimum of fs m+1 - Also note 
that at epoch m the optimization over f$ m is initialized at x m which is the point reached at 
the previous epoch. 

Suffix-SGD (Algorithm 2), is a stochastic optimization algorithm for strongly convex 
functions. Its guarantees are presented in Section 5.1. 

5.1 Analysis 

Let us first discuss Suffix-SGD (Algorithm 2). This algorithm performs projected gradient 
descent using the gradients received from GradOracle(-). The projection operator 11^, is 
defined Vy G as 

Ibe(y) := arg min ||x - y|| . 

xe£ 
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Algorithm 1 GradOpt G 

Input: target error e, maximal failure probability p, decision set 1C 
Choose xi e /C uniformly at random. 

Set (h = diam(/C)/2, p = p/M, and M = log 2 ^ where a 0 = min{ 2Ld ^ {K) , 
for m — 1 to M do 

// Perform SGD over f$ m 
Set e m := a8^/32, and 


m 12480L 2 /2 12480L 2 , 

Tf =-log (- + 2 log- 

- V 'Tt ST C ' 


<7£ r 


p 


<7£ r 


Set shrinked decision set, 


IC rn := /C D B(x m , l.55 m ) 


Set gradient oracle for fs m , 


GradOracle(-) = SGO G (-,<5 m ) 


Update: 


x m+1 G- Suffix-SGD (Tp, JC m , x m , GradOracle) 


8m+l 8 m /2 

end for 
Return: x^+i 


Algorithm 2 Suffix-SGD 

Input: total time Tp, decision set 1C, initial point X] G 1C, gradient oracle GradOracle(-) 
for t — 1 to Tp do 
Set rj t = 1/at 

Query the gradient oracle at x t : 


g t G- GradOracle(xt) 


Update: x m G- II/c(x t - rj t g t ) 

end for 

Return: x Tf := ^(x Tf/2+1 + ... + x Tjr ) 


Now consider a cr-strongly convex function F : 1C —> M, and suppose that we have an 
oracle, GradOracle(-), that upon querying a point x G 1C returns an unbiased and bounded 




gradient estimate, g, i.e., E[g] = VF(x), and ||g|| < G. Note the following result from 
Rakhlin et al. (2011) regarding stochastic optimization of cr-strongly-convex functions, given 
such an oracle: 


Theorem 5.2. Let p G (0,1/e), and F be a a-strongly convex function. Suppose that 
GradOracle(-) produces G-bounded, and unbiased estimates of V F. Then after no more than 
Tf rounds, the final point xp F returned by Suffix-SGD (Algorithm 2 ) ensures that with a 
probability > 1 — p, we have: 


F(x T f ) — min F(x) < 


6240 log (21og(Tp)/p)G 2 
uTp 


Corollary 5.1. The latter means that for T F > 124 ^/ G ~ log (2/p + 2 log(12480G 2 /cxe)) we 
will have an excess loss smaller than e. 


Notice that at each epoch m of GradOpt G , it initiates Suffix-SGD with a gradient oracle 
SGO G (-,h m ). According to Lemma 4.2, SGO G (-,h m ) produces an unbiased and L-bounded 
estimates of f 8m , thus in the analysis of each epoch we can use Theorem 5.2 for f$ m , taking 
G = L. 

Following is our key Lemma: 


Lemma 5.1. Consider M, JC m and x m+1 as defined in Algorithm 1. Also denote by x* n the 
minimizer of f 8m in /C. Then the following holds for all 1 < m < M w.p.> 1 — p: 

1. The smoothed version fs m is a-strongly convex over K m , and x( n e 1C m . 

2. Also, f 5m (x m+1 ) - f Sm (x* m ) < aS m+ 1 / 8 


Proof. We will prove by induction, let us prove it holds for m — 1. Note that hi = diam(/C)/2, 
therefore /Ci = /C, and also x* G K\. Also recall that cr-niceness of / implies that fs t is a- 
strongly convex in /C, thus by Corollary 5.1, after less than Tp = ^(yfypy/y) optimization 
steps of Suffix-SGD with a probability greater than 1 — p/M , we will have: 

/<5i(x 2 ) - /<5i(xi) < ab\/ 32 = ab\/ 8 


which establishes the case of m = 1. Now assume that lemma holds for m > 1. By this 
assumption, f 5m (x m+l ) - f 5rn (x * m ) < crh^ +1 /8, f 8m is a-strongly convex in /C m , and also 
x^ G /C m . Hence, we can use Equation (1) to get: 


|X m+ i - X m || <\l~y f8m (Xjn+l) - f8m( x *m) = 


m+1 


Combining the latter with the centering property of a-niceness yields: 


Cm+l 


— X 


m+1 II — 


NTO+1 


— X. 


+ X* - X 


m+11 


A l-5h m+ i 


9 







and it follows that, 

X m+1 G -B(x m+ i, 1.5(5 m+ i) C f?(x m+l) 3(5 m +l) 

Recalling that /C m+ i := L>(x, n+ i, 1.5<5 m+ i), and the local strong convexity property of / 
(which is cx-nice), then the induction step for first part of the lemma holds. Now, by Corol¬ 
lary 5.1, after less than T F = 0( ) optimization steps of Suffix-SGD over A ,, we 

will have: 

fSm+i (^bra+ 2 ) ffim+i ( X m+l) — 2 /® 

which establishes the induction step for the second part of the lemma. 

An analysis of fail probability: since we have M epochs in total and at each epoch the fail 
probability is smaller than p/M, then the total fail probability of our algorithm is smaller 
than p. J3 

We are now ready to prove Theorem 5.1: 

proof of Theorem 5.1. Algorithm 1 terminates after M = log 2 ^ epochs meaning, 5m = 
diam(/C)a!o£/2. According to Lemma 5.1 the following holds w.p.> 1 — p , for every x £ 1C, 

/{ M (*if+l) - /<5 m( X ) < ° S M+ 1/8 

/ v /o r diam(/C)a 0 e\ 2 

= l 871 ) 

Due to Lemma 4.1, fs M is L5m biased from /, thus for every x £ 1C, 

/(x M+ i) - /(x) < Ldiam(/C)a 0 e + ^ T^diam^C)^ ^ 

< £ 


we used cc 0 = min{ 2Ldi l m(lc) , and £ < 1. 

Let T tota i, be the total number of queries made by Algorithm 1, then we have: 


M 

Ttotai < ^ 
m— 1 
M 


12480L 2 


OEr. 


i°g r 


12480L 2 , „ 

^ lQ g r 


^ ^M™/ 32 ) 


4 • 10 & L 2 logT 


M 


A 


\i— 1 


O 


^ 5f 

i= 1 1 


< 


< 


14 • 10 4 L 2 logT 4 


M 


a 2 5 2 
14- 10 4 L 2 logT 


(X 


max{16L , cr/2} — 
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here we used the notation: 


r := — + 21og(12480L 2 /a£ M ) 

V 

< — + 21og(4 • 10 5 L 2 max{16L 2 , -}/a 2 e 2 ) 

p 2 

□ 


6 Graduated Optimization with a Value Oracle 

In this section we assume that we can access a noisy value oracle for /. 

Thus, given x G M d , <5 > 0 we can use SGOy (Figure 2) as an oracle that outputs an 
unbiased and bounded estimates for V/^(x), as ensured by Lemma 4.3. Note that for ease 
of notation SGOy (Figure 2) is listed using an exact value oracle for /. As described at the 
end of Section 4.1.1, this could be replaced with a noisy value oracle for /, and Lemma 4.3, 
will still hold. 

Following is our main Theorem: 

Theorem 6.1. Lets > 0 andp G (0,l/e) ; also let /C be a convex set, and f be an L-Lips chit z 
cr-nice function. Assume also that max x |/(x)| < C. Suppose that we apply Algorithm 3, 
then after after 0(d 2 /a 2 e 4 ) rounds Algorithm 3 outputs a point x M+1 which is e optimal with 
a probability greater than 1 — p. 


6.1 Analysis 

Notice that at each epoch m of GradOpty, it initiates Suffix-SGD with a gradient oracle 
SGOy(-,<5 m ). According to Lemma 4.3, SGOy(-,5 m ) produces an unbiased and dC/5 m - 
bounded estimates for the gradients of fs m , thus in the analysis of each epoch we can use 
Corollary 5.1 for fs m , taking G = dC/5 m . 

Following is our key Lemma: 

Lemma 6.1. Consider M, JC m and x m+1 as defined in Algorithm 3. Also denote by x* n the 
minimizer of fs m in 1C. Then the following holds for all 1 < m < M w.p.> 1 — p: 

1. The smoothed version fs m is a-strongly convex over K. m , and x* m G lC m . 

2. Also, f 5m {x m+l ) - f Sm (x*J < aSl +1 /8 

The proof of Lemma 6.1 is similar to the proof of Lemma 5.1 given in Section 5.1, we 
therefore omit the details. 

We are now ready to prove Theorem 6.1: 
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Algorithm 3 GradOpty 

Input: target error e, maximal failure probability p, decision set /C 
Choose xi G /C uniformly at random. 

Set = diam(/C)/2, p = p/M , and M = log 2 ^ where o 0 = min{ 2Ldi ^ m(K) , 7 ^^)} 
for m = 1 to M do 

// Perform SGD over fs m 
Set := crh^/32, and 


?> = 


12480 cPC 2 ,2 12480d 2 C\ 

-—log (- + 2 log- 


Set shrinked decision set, 


K^m • fl l*b(5 m ) 


Set gradient oracle for fg m , 


GradOracle(-) = SGOy(-,h m ) 


Update: 


x m+1 G- Suffix-SGD(T F , /C m , x m , GradOracle) 


&m +1 — ^m/2 

end for 
Return: x^+i 


proof of Theorem 6.1. Let x«+i be the output of Algorithm 3. Similarly to the proof of 
Theorem 5.1, we can show that for every x G /C: 

/(x M+ i) - /(x) < £ 
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Let T tota i, be the total number of queries made by Algorithm 3, then we have: 


M 

Ltotal A 

m =1 


12480d 2 C 2 


log r 


< 

< 

< 

< 


M 


^ 12480d 2 C 2 

hi ^W/32)^ 8 


4- 10 5 d 2 C 2 logT A 8* _1 
1=1 

6 - 10 4 d 2 C 2 logT 8 


M 


U 


6 - 10 4 d 2 C 2 logr 


17 * 


max{256L 4 , cr 2 /4} — 


here we used the notation: 

I' : + 21og(12480d 2 C 2 /a£ M ^) 

< + 21og(4 • 10 5 d 2 C 2 max{256L 4 , ^-}/<j 2 e 4 ) 


□ 


7 Experiments 

In the last two decades, performing complex learning tasks using Neural-Network (NN) 
architectures has become an active and promising line of research. Since learning NN archi¬ 
tectures essentially requires to solve a hard non-convex program, we have decided to focus 
our empirical study on this type of tasks. As a test case, we train a NN with a single hidden 
layer of 30 units over the MNIST data set. We adopt the experimental setup of Dauphin 
et al. (2014) and train over a down-scaled version of the data, i.e., the original 28 x 28 images 
of MNIST were down-sampled to the size of 10 x 10. We use a ReLU activation function, 
and minimize the square loss. 

7.1 Smoothing the NN 

First, we were interested in exploring the non-convex structure of the above NN learning 
task, and check whether our definition of u-nice complies with this structure. We started by 
running MSGD (Minibatch Stochastic Gradient Descent) on the problem, while using a batch 
size of 100 , and a step size rule of the form r] t = r] 0 (l + yt) -3 / 4 , where r / 0 = 0 . 01 , 7 = 10 -4 . 
This choice of step size rule was the most effective among a grid of rules that we examined. 
We have found out that MSGD frequently “stalls” in areas with a relatively high loss, here 
we relate to points at the end of such run as stall-points. 
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Figure 4: The objective near a stall point. Left: 5 = 0. Middle: 5 = 3. Right: 5 = 7. 


In order to learn about the non-convex nature of the problem, we examined the objective 
values along two directions around stall-points. The first direction was the gradient at the 
stall point, and the second direction was the line connecting the stall-point to x*, where x* 
is the best weights configuration of the NN that we were able to find. A drawing depicting 
typical results appears on the left side of Figure 4. The stall-point appears in red, and x* 
in green; also the axis marked as X is the gradient direction, and one marked Y is the 
direction between stall-point and x*. Note that the stall-point is inside a narrow “valley”, 
which prevents MSGD from “seeing” x*, and so it seems that MSGD slowly progresses 
downstream. Interestingly, the objective around x* seems strongly-convex in the direction 
of the stall point. 

On the middle of Figure 4, we present the 5 = 3 smoothed version of the same objective 
that appears on the left side of Figure 4. We can see that the “valley” has not vanished, 
but the gradient of the smoothed version leads us slightly towards x* and out of the original 
“valley”. On the right side of Figure 4, we present the 5 = 7 smoothed version of the 
objective. We can see that due to the coarse smoothing, the “valley” in which MSGD was 
stalled, has completely dissolved, and the gradient of this version leads us towards x*. 

7.2 Graduated Optimization of NN 

Here we present experiments that demonstrate the effectiveness of GradOpt G (Algorithm 1) 
in training the NN mentioned above. First, we wanted to learn if smoothing can help us 
escape points where MSGD stalls. We used MSGD (5 = 0) to train the NN, and as before 
we found that its progress slows down, yielding relatively high error. We then took the point 
that MSGD reached after 5 • 10 4 iteration and initialized an optimization over the smoothed 
versions of the loss; this was done using smoothing values of {1,3, 5, 7}. In Figure 5 we 
present the results of the above experiment. 

As seen in Figure 5, small 5’s converge slower than large 5’s, but produce a much better 
solution. Furthermore, the initial optimization progresses in leaps, for large 5’s the leaps are 
sharper, and lower 5’s demonstrate smaller leaps. We believe that these leaps are associated 
with the advance of the optimization from one local “valley” to another; Larger values of 5 
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Figure 5: Running optimization with fixed smoothing values, starting at the point where 
MSGD stuck after 5 • 10 4 iterations. 

dissolve the “valleys” much easily, but converge to points with higher errors than small As, 
due to the increase of the bias with smoothing. 

In Figure 6 we compare our complete graduated optimization algorithm, namely GradOpt G 
(Alg. 1) to MSGD. We started with an initial smoothing of 6 = 7, which decayed according 
to GradOpt G . Note that GradOpt G progresses very fast and yields a much better solution 
than MSGD. 



Figure 6: Comparison between MSGD and GradOpt G . 


8 Discussion 

We have described a family of non-convex functions which admit efficient optimization via 
the graduated optimization methodology, and gave the first rigorous analysis of a first-order 
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algorithm in the stochastic setting. 

We view it as only a first glimpse of the potential of graduated optimization to provable 
non-convex optimization, and amongst the interesting questions that remain we find 

• Is cr-niceness necessary for convergence of first-order methods to a global optimum? Is 
there a more lenient property that better captures the power of graduated optimiza¬ 
tion? 

• Amongst the two properties of cr-niceness, can their parameters be relaxed in terms of 
the ratio of smoothing to strong-convexity, or to centering? 

• Can second-order/other methods give rise to better convergence rates / faster algo¬ 
rithms for stochastic or offline cr-nice non-convex optimization? 
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